Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
New Word Detection Based on an Improved PMI Algorithm for Enhancing Segmentation System
DU Liping, LI Xiaoge, YU Gen, LIU Chunli, LIU Rui
Acta Scientiarum Naturalium Universitatis Pekinensis    2016, 52 (1): 35-40.   DOI: 10.13209/j.0479-8023.2016.024
Abstract1657)   HTML    PDF(pc) (401KB)(1662)       Save

This paper presents an unsupervised method to identify internet new words from the large scale web corpus, which combines with an improved Point-wise Mutual Information (PMI), PMIk algorithm, and some basic rules. This method can recognize internet new words with length from 2 to n (n is any number as needed). Experimented based on 257 MB Baidu Tieba corpus, the precision of proposed system achieves 97.39% when the parameter value of PMIk algorithm is equal to 10, and the precision increases 28.79%, compared to PMI method. The results show that proposed system is significant and efficient for detecting new word from the large scale web corpus. Compiling the results of new word discovery into user dictionary and then loading the user dictionary into ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System), experimented with 10 KB Baidu Tieba corpus, the precision, the recall and F-measure were promoted 7.93%, 3.73% and 5.91% respectively, compared with ICTCLAS. The result show that new word discovery could improve the performance of segmentation for web corpus significantly.

Related Articles | Metrics | Comments0